✅ Identify primary keys and foreign keys that link datasets
✅ Understand the different types of joins (left, right, inner, full, semi, anti)
✅ Perform joins in R using dplyr (left_join(), etc.)
✅ Recognize and fix common merge problems (duplicates, mismatched names, missing values)
✅ Distinguish between merging (columns) vs appending (rows)
✅ Apply joins to a real-world dataset (Gapminder)
✅ Debug messy joins with practical tools (anti_join(), cleaning, deduplication)
Motivation: Real analysis rarely lives in one table. Joining is the bridge that lets us answer richer questions.
Why learn merging?
It’s rare that a data analysis task involves only a single data frame. Usually, you have multiple data sets and want to join them together to answer the question you are interested in.
For example, consider wanting to determine the relationship between GDP and the number of cows in a country.
It’s rare that a dataset would have both of these things in common.
However, there are separate datasets for these things:
Duplicate keys (many-to-many joins, which may blow up the dataset unexpectedly).
Implicit joins when you forget to specify by
Common pitfalls in joining
Using janitor::clean_names(), stringr for trimming/standardizing, or countrycode for harmonizing country names.
Checking results after joins
nrow(), summary(), distinct(), count() to ensure the join behaved as expected.
Left join:
With a left join we retain all observations from x, and we add columns y. Rows in x where there is no matching key value in y will have NA values in the new columns.
Right join:
A right join is just a flipped left join where we retain all observations from y, and we add columns x. Similar to a left join, rows in y where there is no matching key value in x will have NA values in the new columns.
Should I use a right join, or a left join? To answer this, ask “which data frame should retain all of its rows?” - and use this one as the baseline. A left join keep all the rows in the first data frame written in the command, whereas a right join keeps all the rows in the second data frame.
Inner join:
An inner join matches pairs of observations whenever their keys are equal. Consequently, the output of an inner join is all rows from x where there are matching values in y, and all columns from x and y.
An inner join is the most restrictive of the joins - it returns only rows with matches across both data frames.
Full join:
We can also perform a full join where we keep all observations in x and y. This join will match observations where the key variable(s) have matching information in both tables and then fill in non-matching values as NA.
A full join is the most inclusive of the joins - it returns all rows from both data frames.
Semi join:
A semi-join keeps all observations in the baseline data frame that have a match in the secondary data frame (but does not add new columns nor duplicate any rows for multiple matches).
Anti join:
The anti join is another “filtering join” that returns rows in the baseline data frame that do not have a match in the secondary data frame.
Common scenarios for an anti-join include identifying records not present in another data frame, troubleshooting spelling in a join (reviewing records that should have matched), and examining records that were excluded after another join.
Visuals of joins:
If you want to see examples of these in R, you can see some here
Different from appending
Appending is different for merging
Merging adds columns to a dataset
Combines information row-wise
Appending adds rows to a dataset
Think of if we had 2 datasets, one with years 2000-2019, and one with 2020-2025
Gapminder is a well-known dataset that tracks global development statistics over time.
It contains a variety of indicators for most countries in the world and years from 1952 through 2007.
Examples of indicators:
Life expectancy
GDP per capita
Population
Continent
Each row represents a country in a specific year and includes these variables.
Gapminder is popular for teaching data analysis because:
It includes real-world data covering health, wealth, and demographic trends.
It is tidy, clean, and easy to join with other datasets.
Used by Hans Rosling in famous visualizations to show trends in global health and income.
Merge Examples
Note: these examples were inspired by the work of Craig Hutton and his tutorial on joins:
# start by loading in the gapminder data:library(gapminder) # constains the gapminder data, will load it in as an objectlibrary(dplyr)head(gapminder)
# A tibble: 6 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
Create some different datasets to merge on:
# create a dataset of life expectancy for all non-Asian countries:life_df <- gapminder |>filter(continent !="Asia") |>select(country, year, lifeExp) |>ungroup()# create a dataset on population for all non-European countries:pop_df <- gapminder |>filter(continent !="Europe") |>select(country, year, pop) |>ungroup()# create a dataset on GDP for all countries:gdp_df <- gapminder |>select(country, year, gdpPercap) |>ungroup()
Left join:
If we wanted to add population data for each continent that appears in the life expectancy data frame, we could use the left_join():
# A tibble: 6 × 3
country year pop
<fct> <int> <int>
1 Afghanistan 1952 8425333
2 Afghanistan 1957 9240934
3 Afghanistan 1962 10267083
4 Afghanistan 1967 11537966
5 Afghanistan 1972 13079460
6 Afghanistan 1977 14880372
What columns should we merge on?
country and year
Left join:
joined_table <- life_df |>left_join( pop_df, by =c("country", "year"))head(joined_table)
# A tibble: 6 × 4
country year lifeExp pop
<fct> <int> <dbl> <int>
1 Albania 1952 55.2 NA
2 Albania 1957 59.3 NA
3 Albania 1962 64.8 NA
4 Albania 1967 66.2 NA
5 Albania 1972 67.7 NA
6 Albania 1977 68.9 NA
If the key columns have different names, you can tell the join function which columns to use with the equality operator
# create a renamed version:life_df_renamed <-rename(life_df, nation = country)joined_table_renamed <- life_df_renamed |>left_join(pop_df, #since the country column is now called "nation" in life_df, #we have to tell left_join which columns to match on.#You'll get an error if you try by = c("continent", "year") this timeby =c("nation"="country","year"))head(joined_table_renamed)
# A tibble: 6 × 4
nation year lifeExp pop
<fct> <int> <dbl> <int>
1 Albania 1952 55.2 NA
2 Albania 1957 59.3 NA
3 Albania 1962 64.8 NA
4 Albania 1967 66.2 NA
5 Albania 1972 67.7 NA
6 Albania 1977 68.9 NA
Right join:
A right join is basically the same thing as a left_join but in the other direction, where the 1st data frame (x) is joined to the 2nd one (y), so if we wanted to add life expectancy and GDP per capita data we could either use:
a right_join() with life_df on the left side and gdp_df on the right side, or
a left_join() with gdp_df on the left side and life_df on the right side
… and get the same result with only the columns arranged differently…
# try via right join:rj <- life_df |>right_join(gdp_df, by =c('year', 'country'))# try via left join:lj <- gdp_df |>left_join(life_df, by =c('year', 'country'))# arrange and select them:rj <- rj |>arrange(country, year, gdpPercap, lifeExp) |>select(country, year, gdpPercap, lifeExp)lj <- lj |>arrange(country, year, gdpPercap, lifeExp) |>select(country, year, gdpPercap, lifeExp)# look at the datahead(rj)
# A tibble: 6 × 4
country year gdpPercap lifeExp
<fct> <int> <dbl> <dbl>
1 Afghanistan 1952 779. NA
2 Afghanistan 1957 821. NA
3 Afghanistan 1962 853. NA
4 Afghanistan 1967 836. NA
5 Afghanistan 1972 740. NA
6 Afghanistan 1977 786. NA
head(lj)
# A tibble: 6 × 4
country year gdpPercap lifeExp
<fct> <int> <dbl> <dbl>
1 Afghanistan 1952 779. NA
2 Afghanistan 1957 821. NA
3 Afghanistan 1962 853. NA
4 Afghanistan 1967 836. NA
5 Afghanistan 1972 740. NA
6 Afghanistan 1977 786. NA
# check to see if they are identcial:identical(lj, rj)
[1] TRUE
Full Join:
After aligning rows by matches in the key column(s), a full join retains all rows that appear in x or y
# join the two of themfull_join_df <- life_df |>full_join(gdp_df, by =c('country', 'year')) head(full_join_df)
# check the dimensions of the full join and compare them to the left join:dim(full_join_df)
[1] 1704 4
dim(life_df)
[1] 1308 3
Inner Join:
If you want to work with data that match entries in both data sources, you would use the inner_join():
# Mean life expectancy and population:life_expect_population <- life_df |>inner_join(pop_df, by =c('country', 'year'))# check the dimensions:dim(life_expect_population)
[1] 948 4
# check summary to see that there are no NAs:summary(life_expect_population)
country year lifeExp pop
Algeria : 12 Min. :1952 Min. :23.60 Min. : 60011
Angola : 12 1st Qu.:1966 1st Qu.:44.68 1st Qu.: 2053003
Argentina: 12 Median :1980 Median :52.60 Median : 5068804
Australia: 12 Mean :1980 Mean :54.51 Mean : 14506347
Benin : 12 3rd Qu.:1993 3rd Qu.:64.91 3rd Qu.: 12431196
Bolivia : 12 Max. :2007 Max. :81.23 Max. :301139947
(Other) :876
Anti-Join:
Anti joins return the rows x that do not appear in y:
# keep the rows in GDP that do not appear in the Population df:anti_join_df <- gdp_df |>anti_join(pop_df, by =c('country', 'year'))head(anti_join_df)
# try the other version:anti_join_df_reverse <- pop_df |>anti_join(gdp_df, by =c('country', 'year'))# should have no observations heredim(anti_join_df_reverse)
[1] 0 3
Anti joins can be very useful if you want to know which rows are excluded due to mismatches in the key columns.
Checking for consistencies and inconsistencies between data sources is an important part of the data cleaning process and can often help to uncover data entry or coding errors that should be fixed prior to conducting any analyses.
Binding Rows:
# imagine we have two datasets that we want to append:gapminder1900s <- gapminder |>filter(year <2000)gapminder2000s <- gapminder |>filter(year >=2000)# bind them together via rbind:combined_gapminder <- gapminder1900s |>rbind(gapminder2000s)# look at results head(combined_gapminder)
# A tibble: 6 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
summary(combined_gapminder)
country continent year lifeExp
Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
Algeria : 12 Asia :396 Median :1980 Median :60.71
Angola : 12 Europe :360 Mean :1980 Mean :59.47
Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
Australia : 12 Max. :2007 Max. :82.60
(Other) :1632
pop gdpPercap
Min. :6.001e+04 Min. : 241.2
1st Qu.:2.794e+06 1st Qu.: 1202.1
Median :7.024e+06 Median : 3531.8
Mean :2.960e+07 Mean : 7215.3
3rd Qu.:1.959e+07 3rd Qu.: 9325.5
Max. :1.319e+09 Max. :113523.1
Working through a messy exercise together:
library(dplyr)library(stringr)library(tibble)# Messy GDP datagdp_df <-tribble(~country, ~year, ~gdp_percap,"United States", 2007, 48000,"United States ", 2007, 48000, # extra spaces -> duplicate key"Brasil", 2007, 9000,"Viet Nam", 2007, 2300,"DRC", 2007, 300)# Messy life expectancy data (different key names + an extra “Atlantis” row)life_df <-tribble(~nation, ~yr, ~life_exp,"United States", 2007, 78.1,"Brazil", 2007, 72.4, # note spelling “Brazil”"Viet Nam", 2007, 74.3,"Democratic Republic of the Congo", 2007, 47.0,"Atlantis", 2007, 88.8)# A tiny crosswalk you’ll use to harmonize country names -> ISO3crosswalk <-tribble(~raw_name, ~iso3c,"United States", "USA","United States ", "USA","Brasil", "BRA","Brazil", "BRA","Viet Nam", "VNM","DRC", "COD","Democratic Republic of the Congo", "COD","Atlantis", "ATL")
Take some time to inspect keys and duplicates:
# Count potential keys in each data setgdp_dups <- gdp_df |>count(country, year, name ="n") |>filter(n >1)life_dups <- life_df |>count(nation, yr, name ="n") |>filter(n >1)gdp_dups
# A tibble: 0 × 3
# ℹ 3 variables: country <chr>, year <dbl>, n <int>
life_dups
# A tibble: 0 × 3
# ℹ 3 variables: nation <chr>, yr <dbl>, n <int>
Standardize keys:
Trim whitespace; map country names to iso3c via the crosswalk
# A tibble: 5 × 3
iso3c year gdp_percap
<chr> <dbl> <dbl>
1 USA 2007 48000
2 USA 2007 48000
3 BRA 2007 9000
4 VNM 2007 2300
5 COD 2007 300
life_clean <- life_df |>mutate(nation =str_squish(nation)) |>left_join(crosswalk, by =c("nation"="raw_name")) |>transmute(iso3c, year = yr, life_exp)head(life_clean)
# A tibble: 5 × 3
iso3c year life_exp
<chr> <dbl> <dbl>
1 USA 2007 78.1
2 BRA 2007 72.4
3 VNM 2007 74.3
4 COD 2007 47
5 ATL 2007 88.8
Ensure keys are unique before joining:
# If either of these returns rows, you’ve got many-to-many riskgdp_clean |>count(iso3c, year) |>filter(n >1)
# A tibble: 1 × 3
iso3c year n
<chr> <dbl> <int>
1 USA 2007 2
life_clean |>count(iso3c, year) |>filter(n >1)
# A tibble: 0 × 3
# ℹ 3 variables: iso3c <chr>, year <dbl>, n <int>
# Fix: for demo, deduplicate GDP by keeping one row per key (e.g., max gdp_percap)gdp_clean <- gdp_clean |>group_by(iso3c, year) |>summarize(gdp_percap =max(gdp_percap), .groups ="drop")
Join + check:
# Left join GDP <- LifeExp (keeps GDP universe)analytic <- gdp_clean |>left_join(life_clean, by =c("iso3c", "year"))# What didn’t match on the life side?unmatched_in_life <- gdp_clean |>anti_join(life_clean, by =c("iso3c","year"))# What countries appear only in life (e.g., “Atlantis”)?only_in_life <- life_clean |>anti_join(gdp_clean, by =c("iso3c","year"))list(analytic_preview =head(analytic),rows_analytic =nrow(analytic),unmatched_in_life = unmatched_in_life,only_in_life = only_in_life)
$analytic_preview
# A tibble: 4 × 4
iso3c year gdp_percap life_exp
<chr> <dbl> <dbl> <dbl>
1 BRA 2007 9000 72.4
2 COD 2007 300 47
3 USA 2007 48000 78.1
4 VNM 2007 2300 74.3
$rows_analytic
[1] 4
$unmatched_in_life
# A tibble: 0 × 3
# ℹ 3 variables: iso3c <chr>, year <dbl>, gdp_percap <dbl>
$only_in_life
# A tibble: 1 × 3
iso3c year life_exp
<chr> <dbl> <dbl>
1 ATL 2007 88.8
What We Learned Today
🔑 Keys: primary, foreign, compound
🧩 Types of joins and when to use them
🛠️ Syntax: by = c("key1", "key2")
📊 Practical joins with Gapminder data
🚦 Diagnostics: anti_join() to find mismatches, checking duplicates
➕ Appending vs. merging
🧪 Hands-on messy merge exercise
✅ You should now feel comfortable:
Performing merges in R
Auditing and debugging joins
Understanding how joins affect your dataset size/content